## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
Here is a data on each colum about value distribution between each wine cases.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
What is clear from this table is that quality ratings only varies only from 3 to 8.2nd - density variable isn’t vary much which might not have effect for quality ratings. 3rd - We only have 1599 whine cases which is not arebig number for the analysis.
Most vines are valued in between 5 and 6 quality rating. However, there are way lower number of cases where wine ratings are below 5 and higher than 6. Quality ratings are distributed symetrically in between 3 and 8.
Fixed acidity distribution is right skewed with a peak of around 7. Values are in range 4.6 and 15.9.
Violatile acidity distribution is right skewed with a peak of 0.6 . Volatile acidity is varying from 0.12 to 1.58.
Citric acidity values varies between 0 and 1. Distribution is similar to right skewed shape with highest number of values at 0 and 0.5.
Residual sugar varies from 0.9 all the way to 16. However most values are in ragne 1.9 to 2.6. This distribution is right skewed.
Chlorides distribution is Right skewed. Most values are at 0.79 and varies from 0.012
Distribution is right skewed. Most values are about 5 and it varies from 1 to 72.
Total sulfur dioxide is also right skewed. These values varies a lot: from 6 to 289. Most values are low ones - from 6 to 70.
Density distribution is symetrical.However wine density does not vary a lot.
pH values stays in acid range. It varies from 2.7 to 4.1. Values distributed symetrically.
Sulphate values distributed in right skewed shape.Values ranges from 0.33 to 2. Most values are between 9 and 9.5.
Looks like alcohol values are also distributed in right skewed shape. Values ranges 8.49 to 14.90. Looks like most values are between 9 and 10.
This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the of the Portuguese “Vinho Verde” red wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).According to these ratings we will try find what properties of wine that make wine get the highest ratings. Here we can see all collumns.
Chemical features of wine.
None of the atributes are stand out. However alcohol, fixed acidity, violatile acidity, citric acidity, free & total sulfur dioxides are interest of mine because these atributes varies in wider range.
I haven’t created yet.
We checked for misisng values, however none of them were identified.
Because nothing stands out, we will try to plot every variable with quality atrribute.
As we can see violatile acidity tend to be lower, when quality ranking is going higher.
Mean and median values are almost the same accross all quality ratings.
Citric acidity tend to be higher in higher quality ratings.
Looking to quality ratings, residual sugar stays almost the same accros the ratings.
Chlorides also stays the same.
Density has a slight decrease, however it’s values varies in very small numbers.
Total sulfur dioxide remains almost the same except of slight peak at rating of 5.
pH is tend to be slighty more acid
Sulphates has a small increase in higher quality rating wines.
Average and median values of alchocol tend to be almost the same from ratings 3 to 5. However in higher quality ratings wines achohol values tend to be higher.
Most median and mean values from attributes doesn’t vary alot or stays the same in all ratings. However, alcohol, volatile.acidity, citric.acid, sulpahtes stands out from all atributes by having most changes accross quality ratings.
Alcohol has the strongest relationship.
We will try to group all quality ratings in three groups: - Low rating group (ratings in range 3 to 5) - Midle rating group (ratings in range 5 to 6) - High rating group (ratings in range 6 to 8)
The reason of middle rating group to be “narrow” is because most of wines gets ratings between 5 and 6.
As wecan see, green dots shifted to the right compared yellow ones. That shows wines that has higher quality rating might tend to have higher alcohol quantity. Also, higher quality wines tend to have slightly higher sulphates quantities.
This graph indicates that lower rating group tend to have higher violatile acidity and lower alcohol rating comapred to higher rating wines.
Here we can see that “lower quality” wines tend to have low and high values of citric acidity. However, “higher quality” wines a tend to have slighty more values in higher values of citric acidity.
##
## Calls:
## m1: lm(formula = I(quality) ~ I(alcohol), data = pf)
## m2: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity, data = pf)
## m3: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + citric.acid,
## data = pf)
## m4: lm(formula = I(quality) ~ I(alcohol) + volatile.acidity + citric.acid +
## sulphates, data = pf)
##
## ============================================================================
## m1 m2 m3 m4
## ----------------------------------------------------------------------------
## (Intercept) 1.875*** 3.095*** 3.055*** 2.646***
## (0.175) (0.184) (0.194) (0.201)
## I(alcohol) 0.361*** 0.314*** 0.314*** 0.309***
## (0.017) (0.016) (0.016) (0.016)
## volatile.acidity -1.384*** -1.343*** -1.265***
## (0.095) (0.114) (0.113)
## citric.acid 0.068 -0.079
## (0.103) (0.104)
## sulphates 0.696***
## (0.103)
## ----------------------------------------------------------------------------
## R-squared 0.227 0.317 0.317 0.336
## adj. R-squared 0.226 0.316 0.316 0.334
## sigma 0.710 0.668 0.668 0.659
## F 468.267 370.379 246.976 201.777
## p 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1621.814 -1621.596 -1599.093
## Deviance 805.870 711.796 711.603 691.852
## AIC 3448.114 3251.628 3253.192 3210.186
## BIC 3464.245 3273.136 3280.078 3242.448
## N 1599 1599 1599 1599
## ============================================================================
Features strengthen each, but not by much. Also alcohol and citric acidity has positive relationship whereas violatile.acidity - negative relationship.
The interesting findings is that how small affect has each of theses atributes. Anotehr suprising thing - looks like strongest correlation has an alcohol compared to ohter atributes.
Yes, we did linear regression with variables that looks like have most effect. However results shows none of variables has an effect on wine ratings. ——
As we cans see ratings are distributed almost in perfect noraml distribution. But even though ratings can be from 0 to 10, in real word it varies from 3 to 8. Also there are no float ratings, because wines are evaluated in teger values.
Looks like alcohol has a biggest effect on overall rating. Hewever it seems that mean and median of alcohol values in quality ratings 3 to 5 remains almost stable. But from rating 5, alcohol mean and median tends to rise. It seems that on higher quality ratings, lowest alcohol values tend to grow up whereas highest alcohol values remains the same acrross all ratings that are 5 or higher.
In the last chart we compare Quality index with quality ratings. In perfect word with every higher quality range (ratings 3-5, 5-6, 6-8), all wine should be in higher position in the graph due to higher quality index. However this is not the case adn perhaps that’s why the correlation between wine atributes and quality ratings are low. As we can see lowest quality index in all cathegories are the same through low middle and high rating groups. But highest quality indexes has small difference: low quality ratign group tend to have lower values whereas middle and lower high rating wines tend to have the highest quality index values. What’s interesting is the highest quality rating wines tend have lower values - all most the same as of quality rating 4. ——
This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent). During this analysis I tried to look at histograms of every variable in order to get sense how values are distributed and how far they fluctuate. Then I compared every chemical variable with quality rating with that. The goal was to identify patterns that might affect rating values. After variables being chosen, i check how much effect they have in linear model as well as making plots with quality rating groups.
Before starting this analysis I was hoping to find clear patterns of some variables affecting quality ratings. After seeing some histograms it was clear that some variables like density, pH has very low fluctuation. Also they are a byproduct of other variables (Density changes when fluid chemistry differs; pH changes when we have different amounts of acid or base elements. A lot of other variables are either acid or basic). When comparing quality rating with each every variable, there wasn’t any clear tendency. Most promising variable were alcohol, volatile acidity, citric acid and sulfates. However, when analyzing further with quality rating categories, comparing quality rating with quality index (calculated from chosen variables), patterns that chosen variables would affect quality ratings were minimal, if any. Lastly linear model supported my findings: correlation was low. The findings were disappointing since I was expecting different results.
As any analysis it has some flaws that might influence wrong results. This might be due to these reasons: - Because wines were evaluated by humans, rating could be influenced by subject feelings and tastes which this data not taking into account. Also psychological state of the members during testing can also make rating subjective.
In general I’ve found this data analysis expand my view on wines. It also sharpen my data analysis skills and R language programming skills. Although it wasn’t easy, it was worth it.